Detecting Data Theft Using Stochastic Forensics 

Jonathan Grier 

Abstract 

We present a method to examine a filesystem and determine if and when files were copied from 
it. We develop this method by stochastically modeling filesystem behavior under both routine activ¬ 
ity and copying, and identifying emergent patterns in MAC timestamps unique to copying. These 
patterns are detectable even months afterwards. We have successfully used this method to investi¬ 
gate data exfiltration in the field. Our method presents a new approach to forensics: by looking for 
stochastically emergent patterns, we can detect silent activities that lack artifacts. 


1 Background 

Theft of corporate proprietary information, according to the FBI and CSI, has repeatedly been the most 
financially harmful category of computer crime (CSI & FBI, 2003). Insider data theft is especially 
difficult to detect, since the thief often has the technical authority to access the information (Yu & Chiueh, 
2004; Hillstrom & Hillstrom, 2002). Frustratingly, despite the need, no reliable method of forensically 
determining if files have been copied has been developed (Carvey, 2009, p. 217). Methods do exist to 
detect particular actions often associated with copying, such as attaching a removable USB drive (Carvey, 
2009; Carvey & Altheide, 2005). Methods also exist that can detect copying when given a network trace 
of the activity (Liu et al., 2009), or when given the media to which the files were copied to (Chow et al., 
2007). However, no method has yet been discovered that given only a filesystem can determine if its files 
were copied. Carvey summarizes this problem: (Carvey, 2009, p. 217), “there are no apparent artifacts 

of this process [of copying data]_Artifacts of a copy operation... are not recorded in the Registry, or 

within the file system, as far as I and others have been able to determine.” 

In this paper, we develop a method to do exactly that: analyze a filesystem to determine if and when its 
files were copied. We report on the foundations of our method (Section 3), simulated trials (Section 4), 
its mathematical basis (Section 5), and usage in the field (Section 6). 

2 Can we use MAC timestamps? 

Farmer and Venema’s seminal work (Farmer, 2000; Venema, 2000; Farmer & Venema, 2004) describes 
reconstructing system activity via MAC timestamps. MAC timestamps are filesystem metadata which 
record a file’s most recent Modification, Access, and Creation times. By plotting these on a timeline, 
investigators can reconstruct filesystem activity, and hence computer usage, of a particular time. An 
investigator can also plot a histogram of filesystem activity, showing amount of activity per time period 
(Casey, 2004). 

Seemingly, we should be able to use MAC timestamps to detect data exfiltration. However, as mentioned 
above, the standard methods of MAC timestamp analysis fail to do this. Neither timelines nor histograms 
can distinguish copying from other forms of file access. 

Moreover, Microsoft Windows NTFS systems do not update a file’s access timestamp when it is copied. 
Unlike Unix based systems, which implement copy commands in user code via standard reads of the 
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source file and writes to the destination file (Sun Microsystems, Inc., 2009a,b; Free Software Founda¬ 
tion, Inc., 2010), Windows provides a dedicated CopyFileO system operation (Microsoft Corporation, 
2010a). Thus, Unix based filesystems do not distinguish copying a file from other forms of accessing 
it; both are done via read(), and both update the file’s access timestamp. (This was experimentally 
confirmed using the cp command on a Linux 2.6.25 ext3 system.) Windows, however, distinguishes be¬ 
tween the two at the system level. Our experiments (performed on a Microsoft Windows XP Professional 
5.1.2600 system) confirm that Windows indeed does not update the access timestamp of the source file 
when copying it, making file copying seemingly invisible. 


3 Emergent patterns caused by copying 


To be able to detect copying, we must refine our model of its filesystem activity. For the rest of this 
paper, we concern ourselves with the copying of an entire folder with numerous subfolders and files; we 
believe this to be the typical form of data exfiltration. 

We can distinguish between the access pattern of copying and that of routine access. Routine file access is 
selective: individual files and folders are opened while others are ignored. It is also temporally irregular: 
files are accessed in response to user or system activity, followed by a lull in access until the next activity 
causes new file access. Copying of folders, however, is nonselective: every file and subfolder within 
the folder is copied. It is furthermore temporally continuous: files are copied sequentially without pause 
until the entire operation is complete. Copying folders is also recursive: copying one folder invokes 
the copying of all subfolders, which each invoke copying of their subfolders, and so on, while routine 
activity is randomly ordered (see Table 1). 

This recursive nature of copying results in an additional trait. To copy a folder, the system must enumer¬ 
ate the folder’s contents. Modem filesystems implement folders as special types of files called directo¬ 
ries', to enumerate a folder’s contents, the system accesses and reads the directory file. Thus, copying 
will invariably access a directory before accessing its files and subfolders. What’s more, since this is a 
data read and not a file copy, Windows NTFS does update the access time of the directory when its con¬ 
tents are enumerated. Our experiments confirmed that on both the above Windows and Linux systems, 
copying a folder updates the access time of the folder’s directory and all subdirectories. 


Copying Folders 

Routine Access 

Nonselective (all subfolders and files accessed) 
Temporally Continuous 

Recursive 

Directory accessed before its files 

On Windows: Directory timestamps updated, but not file 

Selective 

Temporally Irregular 

Random Order 

Files may be accessed without directory 
Both directory and file timestamps updated 


Table 1: Differences in access timestamp updates between copying folders and routine activity 


Thus, although, as stated above, copying creates no individual artifact, it does create distinct emergent 
patterns. A filesystem examined immediately after copying occurs will show the five characteristics 
enumerated in Table 1. 

However, we cannot yet apply this technique in the field: MAC timestamps, notorious for being quickly 
overwritten, are unreliable. And other types of recursive access besides copying may also cause such 
emergent patterns. We address these problems in Section 4 and Section 7. 
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Access Timestamp Updates for: 

Copying a Folder 

Routine Access 
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&■& Design 
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9 . 9:13:05am (2 McarthySmith.doc 

j""il McarthySmith.doc 5. 9:17:25 am 

io.9:13:05am 1 TBAC-Systems.doc 

1 .® TBAC-Systems.doc 


Figure 1: The left side shows the access timestamp updates that would occur upon copying folder Pro ject Aurora. 

Updates that would occur on Linux but not on Windows are shown in gray. The right side shows up¬ 
dates that might occur during typical user activity, which, unlike folder copying, is selective, temporally 
irregular, and randomly ordered (see Table 1). 


4 Digging for footprints 


Although we have identified distinct emergent patterns caused by copying, we should be skeptical about 
using them in real world investigations. Timestamps are notoriously ephemeral: like footprints, they 
are swept away by newer activity (Farmer & Venema, 2004). If an investigation is performed weeks or 
months after the data theft, do we have any hope of unearthing these emergent patterns in timestamps? 

Surprisingly, the answer is yes: we can indeed detect them even months after the copying, and even 
when the date of the alleged copying is unknown. To do so, we must make two observations: First, 
while normal system activity (ignoring things like intentional tampering or resetting the system clock) 
can increase access timestamps to more recent times, it cannot decrease them. Thus, although access 
timestamps are extremely volatile (as each access overwrites the previous timestamp), they nonetheless 
maintain an invariant of always increasing monotonically. 

Second, filesystem activity is by no means uniformly, or even normally, distributed over files. Activity 
more closely resembles heavy-tailed distributions, such as a Pareto distribution (Wikipedia, 2010): a 
small amount of files generally account for a large portion of activity, with a significant amount of files 
undergoing negligible activity (Vogels, 1999; Gribble etal., 1998; Ferguson, 2002). Farmer and Venema 
(Farmer & Venema, 2004, p. 4) report that over periods as long as a year, the majority of files on a typical 
server are not accessed at all. 

Consequently, if a folder was copied, we can expect to find the following, even if several weeks or months 
have elapsed since the time of copying: 


• Neither the copied folder, nor any of its subfolders, have access timestamps less than the time of 
copying. 

• A large number of these folders have access timestamps equal to the time of copying. 
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FolderB (not copied) 


(^Residual activity visible^) 


id) 

(No cutoff cluster) .J 


Figure 2: Access timestamps of two folders after 300 days of simulated act 
Folder A, caused by its copying on day 200, is clearly visible evs 
amount of subsequent activity. 


ity. Note that the cutoff cluster of 
100 days later, even with a large 


• On Windows, file timestamps will not resemble folders’ timestamps. Specifically, many files will 
have access timestamps before any of the folders. 

Copying thus creates an artifact which we call a cutoff cluster: a point in time which no subfolder has 
an access timestamp prior to (hence a cutoff), and which a disproportionate number of subfolders have 
access timestamps equal to (hence a cluster). We generally expect a folder to have a number of rarely 
accessed subfolders, which cause the cutoff cluster to remain detectable for several weeks or months (or 
until the next act of copying). Conversely, in the absence of copying (or other nonselective, recursive 
access), we expect to find some folders with access timestamps extending far back in time, consistent 
with a heavy-tailed distribution. 

To explore this, we simulated a model filesystem containing two folders, FolderA and FolderB. Each 
folder has 1000 children (files or subfolders), created at the start of the simulation, and is accessed 
approximately 100 times a day. File access is distributed randomly using a Pareto distribution. FolderA 
is copied 200 days after the start of the simulation; FolderB is never copied. After 300 days of simulation, 
we tabulated the date of most recent access for each file; that is, the files’ access timestamps. Both folders 
had more than half of their files accessed within the final two weeks of the simulation. Nonetheless, we 
are able to identify a clear cutoff cluster occurring at the time of copying of FolderA (see Figure 2). (An 
interactive version of the simulator is available at http: //www. vesar ia. com/datathef t/sim/ .) 

Note that, on Windows, a cutoff cluster is invisible unless we first filter the histogram to include only 
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subfolders (and not files). 

In short, if a large folder is copied, it will result in a cutoff cluster; this cutoff cluster can be detected 
months after the date of copying, even with low resolution timestamps and substantial amounts of noise. 


5 Quantitative analysis of cutoff clusters 


We now proceed to define metrics allowing us to quantitatively detect and measure cutoff clusters. For 
the remainder of this paper, we concern ourselves with systems such as Windows NTFS, which do not 
update file access timestamps on copy. Modifying our method for use with systems such as Linux ext3, 
which do update file access timestamps on copying, is straightforward. 

We use the conventional model of a filesystem as a tree, with each subfolder a child of its parent folder. 
For each folder /, we define 

D(f) = {/} U {x|x is a descendant folder of /} . 

That is, D(f) is the set of / and all of its descendant folders. Note that only folders, and not files, are 
members of /. For a given time t, we partition D(f ) into four disjoint subsets: 

Db t (f) = {x|x € D(f ) A access Jimestampfx) < t A creation Jimestamp(x) < t} 

De,{f) = {x|x € D(f) A t < access timestamp(x) < t + £ A creation Jimestamp{x) < t} 

Da t (f) = {x|x € D(f ) A access Jimestamp(x) >t + £ A creation Jimestampfx) < t} 

Di t (/) = {x|x € D(f) A creation Jimestamp(x) > t} . 


e should be somewhat greater than the expected duration of copying; a good initial value is 1000 seconds. 
We define a metric Cluster,{f ), indicating the relative size of the cutoff cluster, and thus the likelihood 
that folder / was copied on time t, as follows: 


Cluster ,(/) = 


0 , 

\De t (f)\/(\De t (f)\ + \Da t (f)\), 


if \Db t (f)\ > 0 
otherwise. 


This metric ranges between 0 and 1, indicating the size of the cutoff cluster, relative to the maximum 
size theoretically possible. 


We furthermore define Mag t (f), indicating the sample size, and thus the confidence of Cluster t (f), as 
follows: 


Mag, (f) = 


\ \De,{f)\ + \Da t (f)\, 


if \Db,{f)\ > 0 
otherwise. 


Magtif) ' s on a nominal scale, and is defined to be infinite when Cluster, is zero. Mag,{f) can be 
interpreted as: the more subfolders f has, the larger our sample, and the more confident we are. 


We can define a second confidence metric as follows. Given a set S of folders, let Files(S ) be the set of 
all files contained in any folder in S. We define a confidence metric \Abn t {f)\, where 


Abn,(f) = {x|x € Files(D(f)) A access Jimestamp(x) < t — 5}. 

| Abn, if) | can be interpreted as if a large number offiles in f have access timestamps less than t, while 
no subfolders do, we become very confident that f was indeed copied. High values of \Abn t (/) | give 
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great confidence in Cluster t (f), because they show that the historical file activity is too sparse to have 
created a cutoff cluster by chance. A good value for 8 is 10 days; it should be large enough to distinguish 
the time of the alleged copying from prior historical behavior. Note that | Abn t (/) | is only applicable to 
Windows NTFS and similar systems that do not update file access timestamp on copying (see Section 2 
above). On systems such as Linux ext3 which update the file access timestamp, we need to substitute 
\Abn t (/')|, where f is a folder similar to / that is known to not have been copied. 

In short, these metrics quantitatively measure the cutoff cluster: Cluster t (f) indicates the cutoff cluster’s 
relative size, Cluster t (f) x Mag t (f) its absolute size, and \Abn, (/) | its abnormality. 


6 Field results 

We successfully used these metrics as part of an investigation of suspected data theft. At the time of 
the investigation {tinvestigation), it was suspected that FolderQ had been surreptitiously copied during a 
window 30 - 60 days before ti nvest i gat i on . To investigate this, we computed the metrics on several top level 
folders, for all t in the range (tinvestigation — 180 days ,tinvestigation)- Cluster t (FolderQ) was greater than 
0.3 at t\ ( « ti nvest i gat i on — 50 days), with Mag t > 5000 and \Abn t \ > 50000, forensically supporting the 
suspicion. FolderR also had a non-zero Cluster t value at t2 ( ~ ti nves tigation — 70 days), which subsequent 
investigation determined was due to authorized copying. All other folders examined had zero Cluster, 
values for all t in the range (tinvestigation — 180 days, tinvestigation) (see Table 2). 


FolderQ | FolderR | FolderS \ FolderT | FolderU 


A priori hypothesis 

Suspected of being 
copied 

Not suspected of being copied 

\D(f)\ 

~ 6000 

~ 7000 

«800 

«300 

«50 

Maximum Cluster t 

> 0.3 (at t = t\) 

> 0.9 (at t = t 2 ) 

0 

0 

0 

Indication 

Copied at ti 

Copied at t 2 

Not copied 


Magt 

> 5000 (t=h) 

> 6000 (t = t 2 ) 

oo 

oo 

oo 

\Abn t \ 

> 50000 (t = h) 

> 20000 (t = ti) 

> 1500 

>3000 

>500 

Results 

Suspicion sup¬ 

ported forensically 

Subsequent inves¬ 
tigation determined 
this copying was 
authorized 

Not copied 



Table 2: Metrics applied to field investigation. All values are over range (tinvestigation — 180 days , tinvestigation) unless 
otherwise noted. 


We also plotted histograms of the data: FolderQ and FolderR showed cutoff cluster patterns similar to 
the simulated FolderA shown in Figure 2 above; the other folders did not. Our method thus detected 
copying occurring approximately 2 months beforehand, and demonstrated the absence of copying for 
approximately 6 months beforehand. 


7 Distinguishing different forms of recursive, nonselective access 

These metrics can identify folder copying and distinguish it from routine activity. Besides folder copying, 
there are other types of recursive, nonselective access, such as searching folders for particular files, 
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scanning them for viruses, or even using the POSIX Is -1R command to generate a recursive directory 
listing. While we have not yet used our method to distinguish between these activities, we’re investigating 
doing so via these fingerprinting characteristics: 

• File access. Are all, some, or none of the file access timestamps updated? Copying, depending on 
the system, updates either all files or only folders’ (see Section 2 above), whereas virus scanning 
may update only certain types of files (e.g. executable), and searching typically updates only a 
subset of files having a common subsequence in their name. 

• Skipped folders and files. What types of folders and files are skipped? Possibilities include ones 
beginning with periods, NTFS Alternate Data Streams, NTFS hidden files, NTFS system files, 
Windows Thumbs. db, and OS X DS_Store. 

• Tree traversal method. Is the recursion performed breadth first, depth first, or in another order? 

• Sibling visit order. What order are siblings visited in? Filesystem order may be the most com¬ 
mon, but alphabetical or other orders may be used as well. When a folder contains both files and 
subfolders, is one accessed before the other? 

• Speed. At what rate are folders and files accessed? Does the rate depend on the number of entries? 

On the size of files? It should be noted that a copy command may recursively enumerate all 
descendants of a subfolder before copying any of them, and so the timestamp updates may happen 
much faster than the actual copying. 

We should note that our informal experience is that system activity, such as Windows Volume Snapshot 
Service, as well as most (but not all) backup software, does not modify access timestamps, and hence is 
irrelevant to the above metrics. Likewise, modem file searches, which may use indexes such as Windows 
Search Services (Microsoft Corporation, 2010b), do not necessarily access the individual folders and files 
being searched. We have found, however, that graphical shells (such as Microsoft Windows Explorer) au¬ 
tomatically access various well known User Profile folders (such as Documents and Settings\<User Name>) 
in ways which we have not fully explored; further research is required before applying these metrics to 
such folders. 

Like any forensic method, cutoff clusters are a component of an investigation, not a replacement. The 
absence of a cutoff cluster provides strong evidence that copying has not taken place. When a cutoff 
cluster is found, an investigator will use other means, both digital and human, to investigate its time 
and circumstances. A cutoff cluster occurring in the middle of a workday, caused by an employee who 
frequently uses the Unix command line, with no suspicious activity occurring at the time, may most 
likely be due to a Unix Is -1R or the like. The investigator will attempt to confirm this by finding other 
cutoff clusters occurring regularly in folders used by that employee. In another case, a cluster occurring 
late at night may prompt examination of the building exit records, which show that an employee who 
usually leaves at 5 PM stayed late that night for no apparent reason. Further investigation may show that 
employee had just previously expressed anger at a poor performance review. This may prompt a forensic 
examination of that employee’s PC, revealing further information as the investigation progresses. 


8 Future work 

We are exploring the following improvements to our method: 
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• We would like to perform scientific trials of our method, evaluating how accurately it detects copy¬ 
ing in a blinded test on real world filesystems (see Garfinkel et al., 2009). However, benchmarking 
this requires external a priori knowledge of if and when copying took place, which the standard 
corpora (such as DigitalCorpora.org) lack. In general, this problem often makes such corpora, 
without their accompanying histories, unsuitable for scientifically testing real world phenomena. 

• Our metrics are currently on a nominal scale. We’d working to incorporate a model of expected 
routine filesystem activity to yield a true probability value. 

• Our metric currently assumes that folders can be represented as a tree. Many real world filesystems 
cannot be represented as a tree, because they allow a folder to have multiple parents, such as 
through symbolic links or Windows 7 Libraries (Kiriaty, 2009; Microsoft Corporation, 2010c). 
We feel our metric could be extended to these as well. 

• Before computing the metrics, certain folders which may be omitted from copying (such as hidden 
or permission restricted folders) must be prefiltered from D(f). Since these folders are relatively 
rare, we feel we could dispense with the need to manually prefilter by using a fuzzy threshold for 
Db t , instead of a steep cutoff at zero. 

• Finally, we think there are other activities that, although they may fail to deterministically create 
identifiably unique artifacts, nonetheless result in stochastically emergent patterns. We feel that 
stochastic forensics may enable investigation of these otherwise silent activities. 


9 Experimenting with access timestamps 

In the course of our experiments, we’ve found access timestamp behavior to be quite mercurial. Here are 
the experimental pitfalls we encountered and solutions. 


• Systems may, for performance reasons, decline to update an access timestamp. Since maintaining 
accurate access timestamps may involve substantial performance costs, and isn’t deemed system 
critical, systems may decline to update them. In many systems, this is user configurable (Microsoft 
Corporation, 2003a). In particular, some versions of Microsoft Windows ship configured to disable 
access timestamp updates (Carvey, 2009, p. 205). Complicating things further, some systems may 
selectively update the timestamps, for instance updating only when the newer timestamp differs 
from the previous one by a certain threshold. 

The recommended solution is to check system configuration and documentation before experi¬ 
menting, and to exhaustively observe system behavior under different scenarios. 

• Systems may, for performance reasons, defer writing updates of access timestamps to the filesys¬ 
tem. Even when filesystems do maintain accurate access timestamps, they may cache the updates 
in memory before writing them to a disk (Microsoft Corporation, 2003b). Thus, if a filesystem is 
examined before a system has been shutdown properly, its access timestamps may not be accurate. 

• Systems may report updated access timestamps even before writing them to disk. In cases when 
updates are deferred, queries to the system for access time may return the updated value stored in 
memory, even though it is has not yet been written to disk. Thus, an experimenter may find one 
value if he queries the operating system, and another value if he directly examines the filesystem. 
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• Querying a file’s access timestamp may itself update it. For instance, we have found that using 
Windows Explorer to display a file’s access timestamp will cause the access timestamp to be 
updated to the current time. 


These last three problems can be solved by not using the standard operating system facilities to query 
access time, but instead shutting the operating system down normally and then directly examining the 
filesystem image using specialized tools. Admittedly, this makes experimentation cumbersome. 


10 Conclusions 

As noted, copying of data has no known artifacts. Nonetheless, we can reliably detect emergent patterns 
unique to copying, even months after its occurrence. Statistical mechanics, which treats objects as indi¬ 
vidually unpredictable and looks for patterns which nonetheless emerge stochastically, gives us insight 
beyond the classical laws from which it derives. Similarly, we believe stochastic forensics provides us 
with means to analyze hitherto undetectable activity. 
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